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ON SURROGATE DIMENSION REDUCTION FOR 
MEASUREMENT ERROR REGRESSION: AN INVARIANCE LAW 

By Bing Li 1 and Xiangrong Yin 

Pennsylvania State University and University of Georgia 

We consider a general nonlinear regression problem where the 
predictors contain measurement error. It has been recently discovered 
that several well-known dimension reduction methods, such as OLS, 
SIR and pHd, can be performed on the surrogate regression problem 
to produce consistent estimates for the original regression problem 
involving the unobserved true predictor. In this paper we establish a 
general invariance law between the surrogate and the original dimen- 
sion reduction spaces, which implies that, at least at the population 
level, the two dimension reduction problems are in fact equivalent. 
Consequently we can apply all existing dimension reduction meth- 
ods to measurement error regression problems. The equivalence holds 
exactly for multivariate normal predictors, and approximately for ar- 
bitrary predictors. We also characterize the rate of convergence for 
the surrogate dimension reduction estimators. Finally, we apply sev- 
eral dimension reduction methods to real and simulated data sets 
involving measurement error to compare their performances. 

1. Introduction. We consider dimension reduction for regressions in which 
the predictor contains measurement error. Let X be a p-dimension random 
vector representing the true predictor and Y be a random variable rep- 
resenting the response. In many applications we cannot measure X (e.g., 
blood pressure) accurately, but instead observe a surrogate r-dimensional 
predictor W that is related to X through the linear equation 

(1) W = 7 + T T X + 5, 

where 7 is an r-dimensional nonrandom vector, r is a p by r nonrandom 
matrix and 5 is an r-dimensional random vector independent of (X,Y). The 
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goal of the regression analysis is to find the relation between the response Y 
and the true, but unobserved, predictor X. This type of regression problem 
frequently occurs in practice and has been the subject of extensive stud- 
ies, including, for example, those that deal with linear models (Fuller [15]), 
generalized linear models (Carroll [2], Carroll and Stefanski [5]), nonlinear 
models (Carroll, Ruppert and Stefanski [4]) and nonparametric models (see 
Pepe and Fleming [23]). 

Typically there is an auxiliary sample which provides information about 
the relation between the original predictor X and the surrogate predictor 
W, for example, by allowing us to estimate E^x = cov(W, X). Using this 
covariance estimate we can adjust the surrogate predictor W to align it as 
much as possible with the true predictor X. At the population level this is 
realized by regressing W on X, that is, adjusting W to U = Hxw^w W, 
where T^xw = cov(X, W) and = var(VF). The fundamental question that 
will be answered in this paper is this: If we perform a dimension reduction 
operation on the surrogate regression problem of Y versus U, will the result 
correctly reflect the relation between Y and the true predictor XI 

In the classical setting where the true predictor X is observed, the di- 
mension reduction problem can be briefly outlined as follows. Suppose that 
Y depends on X only through a lower dimensional vector of linear combi- 
nations of X, say /3 T X, where (3 is a p by d matrix with d <p. Or more 
precisely, suppose that Y is independent of X conditioning on (3 T X, which 
will be denoted by 

(2) YMX\j3 T X. 

The goal of dimension reduction is to estimate the directions of column 
vectors of f3, or the column space of (5. Note that the above relation will not 
be affected if (5 is replaced by (5 A for any nonsingular p x p matrix A. This 
is why the column space of 0, rather than (3 itself, is the object of interest in 
dimension reduction. A dimension reduction space provides us with a set of 
important predictors among all the linear combinations of X, with which we 
could perform exploratory data analysis or finer regression analysis without 
having to fit a nonparametric regression over a large number of predictors. 
Classical estimators of the dimension reduction space include ordinary least 
square (OLS) (Li and Duan [21], Duan and Li [13]), sliced inverse regression 
(SIR) (Li [19]), principle Hessian directions (pHd) (Li [20]) and the sliced 
inverse variance estimators (SAVE) (Cook and Weisberg [11]). 

It has been discovered that some of these dimension reduction methods 
can be performed on the adjusted surrogate predictor U to produce consis- 
tent estimates of at least some vectors in the column space of (3 in (2) that 
describes the relation between Y and the (unobserved) true predictor X. 
The first paper in this area is Carroll and Li [3], which demonstrated this 
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phenomenon for OLS and SIR, and introduced the corresponding estimators 
of (3 in the measurement error context. More recently, Lue [22] established 
that the pHd method, when applied to the surrogate problem (U,Y), also 
yields consistent estimators of vectors in the column space of /3. This work 
opens up the possibility of using available dimension reduction techniques 
to estimate f3 by simply pretending U is the true predictor X. 

In this paper we will establish a general equivalence between the dimen- 
sion reduction problem of Y versus U and that of Y versus X . That is, 

(3) YMX\j3 T X if and only if Y JL U\(3 T U. 

This means that dimension reduction for the surrogate regression problem 
of Y versus U and that for the original regression problem of Y versus X are 
in fact equivalent at the population level. Thus the phenomena discovered 
by the above work are special cases of a very general invariance pattern — 
we can, in fact, apply any consistent dimension reduction method to the 
surrogate regression problem of Y versus U to produce consistent dimension 
reduction estimates for the original regression problem of Y versus X. This 
fundamental relation is of practical importance, because OLS, SIR and pHd 
have some well-known limitations. For example, SIR does not perform well 
when the regression surface is symmetric about the origin, and pHd does not 
perform well when the regression surface lacks a clear quadratic pattern (or 
what is similar to it). New methods have recently been developed that can, 
in different respects and to varying degrees, remedy these shortcomings; see, 
for example, Cook and Li [9, 10], Xia et al. [25], Fung et al. [16], Yin and 
Cook [27, 28] and Li, Zha and Chiaromonte [18]. This equivalence allows us 
to choose among the broader class of dimension reduction methods to tackle 
the difficult situations in which the classical methods become inaccurate. 

Sometimes the main purpose of the regression analysis is to infer the 
conditional mean E(Y\X) or more generally conditional moments such as 
E(Y k \X). For example, in generalized linear models we are mainly interested 
in estimating the conditional mean E(Y\X), and for regression with het- 
eroscedasticity we may be interested in both the conditional mean E(Y\X) 
and the conditional variance var(Y\X). In these cases it is sensible to treat 
the conditional moments such as E(Y\X) and var(y|X) as the objects of 
interest and the rest of the conditional distribution f(Y\X) as the (infinite 
dimensional) nuisance parameter, and reformulate the dimension reduction 
problem to reflect this hierarchy. This was carried out in Cook and Li [9] and 
Yin and Cook [26], which introduced the notions of the central mean space 
and central moment space as well as methods to estimate them. If there 
is a p by d matrix [5 with d <p such that E(Y\X) = E(Y\f3 T X), then we 
call the column space of (3 a dimension reduction space for the conditional 
mean E(Y\X). More generally, the dimension reduction space for the feth 
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conditional moment E(Y k \X) is defined as above with Y replaced by Y k . 
In this paper we will also establish the equivalence between the dimension 
reduction spaces for the fc-conditional moments of the surrogate and the 
original regressions. That is, 

(4) E{Y k \X) = E(Y k \[fX) if and only if E(Y k \U) = E(Y k \f3 T U). 

The above invariance relations will be shown to hold exactly under the 
assumption that X and 5 are multivariate normal; a similar assumption 
was also used in Carroll and Li [3] and Lue [22]. For arbitrary predictor and 
measurement error, we will establish an approximate invariant relation. This 
is based on the fact that, when p is modestly large, most projections of a 
random vector are approximately normal (Diaconis and Freedman [12], Hall 
and Li [17]). Simulation studies indicate that the approximate invariance 
law holds for surprisingly small p (as small as 6) and for severely nonnormal 
predictors. 

This paper will be focused on the dimension reduction problems defined 
through relationships such as Y _LL X\0 T X. A more general problem can be 
formulated as Y _LL X\t(X), where t(X) is a (possibly nonlinear) function of 
X; see Cook [8]. Surrogate dimension reduction in this general sense is not 
covered by this paper, and remains an important open problem. 

In Section 2 we introduce some basic issues and concepts related to mea- 
surement error problems and dimension reduction, as well as some machinery 
that will be repeatedly used in our further exposition. Equivalence (3) will 
be established in Section 3 for the case where r in (1) is a p by p nonsin- 
gular matrix. Equivalence (3) for general T will be shown in Section 4. In 
Section 5 we will establish equivalence (4). The approximate equivalence for 
general predictors and measurement errors will be developed in Section 6. 
In Section 7 we will turn our attention to a general estimation procedure 
for surrogate dimension reduction and study its convergence rate. In Section 
8 we conduct a simulation study to compare different surrogate dimension 
reduction methods. In Section 9 we apply the invariance law to analyze a 
managerial behavior data set (Fuller [15]) that involves measurement errors. 
Some technical results will be proved in the Appendix. 

2. Preliminaries. In this section we lay out some basic concepts and 
notation. For a pair of random vectors V\ and V2, we will use SV1V2 t° denote 
the covariance matrix cov(Vi,V2), and for a random vector V, we will use 
Sy to denote the variance matrix var(V). If V\ and V2 are independent, 
then we write V\ _1L V2) if they are independent conditioning on a third 
random element V3, then we write V\ _LL V2IV3. If a matrix E is positive 
definite, then we write E > 0. For a matrix A, the space spanned by its 
columns will be denoted by span(A). If a matrix A has columns a±, . . . ,a p , 
then vec(A) denotes the vector (aj , . . . ,a,p) T . If A,B,C are matrices, then 
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vec(ABC) = (C T <S> A) vec(-B), where <g> denotes the tensor product between 
matrices. 

In a measurement error problem we observe a primary sample on (W, Y), 
which allows us to study the relation between Y and W, and an auxiliary 
sample that allows us to estimate T>y/x, thus relating the surrogate predictor 
to the true predictor. The auxiliary sample can be available under one of 
several scenarios in practice, which will be discussed in detail in Section 
7. We will first (through Sections 3 to 6) focus on developments at the 
population level, and for this purpose it suffices to assume a matrix such as 
^xw is known, keeping in mind that it is to be obtained externally to the 
primary sample — either from the auxiliary data or from prior information. 

Because X is not observed, we use T<xw to adjust the surrogate predictor 
W to make it stochastically as close to X as possible. As will soon be clear we 
can assume E{X) = E(W) = 0. In this case we adjust W to U = Exw^w^ 
(see, e.g., Carroll and Li [3]). Note that if W is multivariate normal this is 
just the conditional expectation E(X\W). Thus U is the measurable function 
of W closest to X in terms of L2 distance. If W is not multivariate normal, 
then U can simply be interpreted as linear regression of X on W. 

3. Invariance of surrogate dimension reduction. Recall that if there is a 
p by d matrix /3, with d <p, such that (2) holds, then we call the column 
space of j3 a dimension reduction space. See Li [19, 20] and Cook [6, 7]. Under 
very mild conditions, such as given in Cook [7], Section 6, the intersection of 
all dimension reduction spaces is again a dimension reduction space, which 
is then called the central space and is written We will denote the 

dimension of Sy\x by Q- Note that q < d for any (3 satisfying (2). Similarly, we 
will denote the central space of Y versus U as Sy\u an d call it the surrogate 
central space. Our interests lie, of course, in the estimation of Sy\xi but 
Sy\jj is all that we can infer from the data. In this section we will establish 
the invariance law 

(5) S Y \ U = Sy\x 

in the situation where r is a p by p nonsingular matrix and X and <5 are 
multivariate normal. 

We can assume without loss of generality that E(X) = and E(U) = 
because, for any p-dimensional vector a, Sy\x = Sy\(x-o) an d Sy\u = 
<5yi([7_ a ). Since we will always assume E{5) = 0, E(X) = E(U) = implies 
that 7 = 0, and the measurement error model (1) reduces to 

(6) W = T T X + 5. 

The next lemma (and its variation) is the key to the whole development 
in this paper. It is also a fundamental fact about multivariate normal distri- 
butions that has been previously unknown. It will be applied to both exact 
and asymptotic distributions. 
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Lemma 3.1. Let U*,V* be r- dimensional and U 2 ,V 2 be s- dimensional 
random vectors with r + s=p. Let 

and let Y be a random variable. Suppose: 

1. Ul , U 2 , V{ , V 2 are multivariate normal. 

2. U* -V* AL(V*,Y). 

Then: 

1. If there is an r- dimensional multivariate normal random vector V 3 * such 
that V 3 * JL (V? - V 3 *, V 2 *), and if U{ JL U%, then Y JL V*\V 3 * implies Y JL 
U*\U{. 

2. // there is an r- dimensional multivariate normal random vector U 3 such 
that E7J JL (t/J - C/J, C/J), and if V? JL V 2 * , then Y JL U*\U£ implies Y JL 
V*\V{. 

We should emphasize that despite its appearance the lemma is not sym- 
metric for U* and V* because of assumption 2; note that we do not assume 
V* — U* _LL (U*,Y). This is why the second assertion, though similar to the 
first, is not redundant. This asymmetry is intrinsic to the measurement error 
problem, where U is a diffusion of X but not conversely. 

PROOF of Lemma 3.1. Write U* as V* + (U* - V*), and we have 

(7) E{e ltTu '\Y) = E{e itTv *e itT V'- v '\Y). 

By assumption 2 we have (U* — V*) JL V*\Y and U* — V* JL Y. Hence the 
right-hand side reduces to 

(8) E(e ltTv *\Y)E(e itT ^- v ^\Y) = E(e ltTv '\Y)E(e ltT ^- v ^). 

Assumption 2 also implies that = var({7* — V*) + and hence that 
jj* _ v* ~ JV(0, - £y*). Thus the right-hand side of (8) further reduces 
to 

£ i^ e « T V|yj e -a/2)t T (E tr .-S v .)t_ 

Substitute this into the right-hand side of (7) to obtain 

(9) E(e itTu * |y) e (V2)t T £^ = E ^v* | F)e (i/2)^sv*t_ 

Now suppose there is a V 3 * such that JL (V* - V 3 ,V 2 ) and Y JL 
F 2 *)l^3*- The latter independence implies Y JL (F* - V 3 *, V 2 *)\V 3 * which, 
combined with the former independence, yields 

(v{ - v 3 \ v 2 *) jl (v 3 *,y) => (y; - y 3 *, y 2 *) jl v 3 *|y 



and 
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Hence 

E(e ltTy *\Y) = £( e ^i*-^V^V^3* \ Y ) 

(10) 

= e(^*- v ^ v »)e(^ v s | y) . 

Now let W* = ((V* - V 3 *) T , V^ T ) T . Because V 3 * JL F 2 * we have 

(ii) £ W * = P^*- V 3 V 2 ') = f%-vr% o\ +Sy 



In the meantime, because W* is multivariate normal we have 
(12) ^(e^^*-^*)^^*) = e -(i/2)t T E^*t_ 

Now combine (9) through (12) to obtain 

Consequently the left-hand side does not depend on t2; that is, we can take 
t2 = without changing it: 

_5(e* tT ^ |y)e* TE ^* = £(e<^* |y) e ^ S ^ il . 
This is equivalent to 



£(e itTf7 *|y) =E{^ u i\Y)e^ v * t+tT ^ v t tl = E(e*% v i \Y)e~ tl 



2 



(13) 

where the second equality follows from the assumption U* _1L U^. Multiply 
both sides by e lrY and then take the expectation to obtain 

E{e irY+itTu *) = E[e irY E{e ltT ^\Y)]e~ tT ^ u i t2 = E(^ tY ^ u *)e~^ u 2 t2 , 

from which it follows that (Y, UQ JL J7|, which implies that Y JL C/*|C/f . 

The second assertion can be proved similarly. Following the same argu- 
ment that leads to (10), we have 



(14) E{e ltTu * \Y) = E(e u i U Z \Y)e 



-(l/2)tf(S t/ «_ [/J )ti-(l/2)t| , S c/ ,ta 



Now combine this relation with (9) and follow the proof of the first assertion 
to complete the proof. □ 

We are now ready to establish the invariance relation (5). 

Theorem 3.1. Suppose: 
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1. X ~ N(/j,x, £jr)> where Sx > 0. 

2. <5 JL (X,Y) and 5 ~ JV(0,Ej), w/iere Ej > 0. 

TTien <Sy|[/ = Syix- 

Proof. Assume, without loss of generality, that E(X) = and E(U) = 
0. We first show that Sy\u C Syix- Denote the dimension of 5y|x by and 
let /3 be a p by g matrix whose columns form a basis of Sy\x- Let C be a p 
by P — q matrix such that £ T Ej//3 = and such that the matrix rj = (J3, Q is 
full rank. Let 

V? = /FZuX^X, v 2 * = c^u^x, 
U{ = (3 T U, U 2 = ( T U. 

Then 

cov(C/ 1 *,[/ 2 *)=/? r S c/ C = 0, 

cov(Vi,V 2 *) = (C T ZuW F XxPr\f3 T Xu(3)=0, 

cov(V 3 *, V{ - V 3 *) = (iFEupyjFVxPr^EuP) 

- {fT. v p){0 r V x p)- 1 r Y. v p) = 0. 

It follows that U$ JL U% and V 3 * JL (Vi* - V 3 *,V 2 *). In the meantime, by 
definition, 

U* - V* = r] T (U - EuE^X). 

However, recall that 

U = SxvfS^T 7 X + Exw^w^ = ^xw^w^wx^x 1 -^ + ^xw^w$ 

( 15 ) _i _ x 

= Ef/E x X + SxVK^iy 

where the second equality holds because E^/x = r T Ex, which follows from 
the independence X _LL 5 and the definition of W; the third equality holds 
because Ej/ = Ex^E^ E^x- Hence U* — V* = r^Uxw^w ^> which is in- 
dependent of (V*,Y"). Finally, we note that V3* is a one-to-one function of 
(3 T X and Y* is a function of X. So Y JL V*| Y 3 *. Thus, by the first assertion 
of Lemma 3.1, we have U* JL Y\U? => U JL Y\(3 T U =$> S Y \u Q S Y \x- 

To prove 5y|x !== ^y\Ui ^ be a matrix whose columns are a basis of 
Sy\U: and C be such that the columns of (/?, Q are a basis of W and £ T Ex/3 = 
0. Let 

^ = (/3 T Sx/?)(/3 T S f/ /3)- 1 /3 T f/, 
y* = (3 T X, V 2 * = C T X. 
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Now follow the proof of Sy\u Q Sy\Xi but this time apply the second asser- 
tion of Lemma 3.1, to complete the proof. □ 

The assumptions made in Theorem 3.1 are roughly equivalent to those 
made in Lue [22], Theorem 1, though our dimension reduction assumption, 
Y 1L X\P T X, is weaker than the corresponding assumption therein, which 
is Y = g(f3 T X,e) where e _LL X — it is easy to show that the latter implies the 
former. For example, if Y = g{(3 T X, e) where e 1L X\(3 T X, then Y 1L X\(3 T X 
still holds but e is no longer independent of X. Except for this dimension 
reduction assumption, our assumptions are stronger than those made in 
Carroll and Li [3], Theorem 2.1. However, our conclusion is stronger than 
those in both of these papers, in that it reveals the intrinsic invariance 
relation between dimension reduction spaces, not limited to any specific 
dimension reduction methods. 

In the next example, we will give a visual demonstration of the invariance 
law. 

Example 3.1. Let p = r = 6, X ~ N(0,I p ), e ~ iV(0, a 2 £ ), 5 ~ N(0,a 2 I p ) 
and 5 _U_ (X, e) . Consider the measurement-error regression model 

(16) Y = 0A((3fX) 2 + 3 sin(/f X/4) + a £ e, W = T T X + 5, 

where f3\ = (1, 1, 1, 0, 0, 0) T and /?2 = (1, 0, 0, 0, 1, 3) T , and r is a p x p matrix 
with diagonal elements equal to 1 and off-diagonal elements equal to 0.5. 
We take a £ = 0.2, as = 1/6, and generate (Xi, Yi, Wi), i = 1, . . . , 400, from 
this model. 

In Figure 1 the left panels are the scatter plots of Y versus (3fX (upper) 
and 02 X (lower) from a sample of 400 observed (X,Y). The 3D shape of Y 
versus (3 T X is roughly a [/-shaped surface tilted upward in an orthogonal 
direction. The right panels are the scatter plots for Y versus PfU and 0%U . 
As predicted by the invariance law, the directions (3% and /?2, which are in 
Sy\Xi a ls° capture most of the variation of Y in its relation to U, although 
the variation for the surrogate problem is larger than that for the original 
problem. 

4. The invariance law for arbitrary T. We now turn to the general case 
where T is a p x r matrix. We will assume r <p, which makes sense because 
otherwise there will be redundancy in the surrogate predictor W. In this 
case W is of dimension r, but the adjusted surrogate predictor U still has 
dimension p, with a singular variance matrix £[/ if r < p. We will assume 
that the column space of T contains the dimension reduction space for Y\X 
(which always holds if T is a nonsingular square matrix). This is a very 
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Fig. 1. Original and surrogate dimension reduction spaces. Left panels: Y versus 0[X 
(upper) and Y versus 02 X (lower). Right panels: Y versus Pi U (upper) and Y versus 
P2U (lower). 



natural assumption — it means that we can have measurement error, but 
this error cannot be so erroneous as to erase part of the true regression 
parameter. 

Theorem 4.1. Suppose that V in (1) is a p by r matrix with r <p, and 
that r has rank r. Suppose that 5 ~ N(0, £5) with £5 > 0, X ~ N(fj,x,^x) 
with T,x > 0, and 5 _LL (X,Y). Furthermore, suppose that Sy\x Q span(r). 
Then S Y \u = S Y \x- 



Proof. First we note that 
(17) YMX\T T X and Y MU\T T U. 
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The first relation follows directly from the assumption span(/3) C span(r). 
To prove the second relation, let Pr(^x) = T(T T T lX T)~ 1 T T 'E x be the pro- 
jection onto span(r) in terms of the inner product (a, b) = a T Tj X b. Then 

var[(J - P T (Z X )) T U] = [I- P r (E x )] T Eu[I - Pr(E x )]. 
However, we note that 

[/ - Pr(Sx)] T Sc/ = (/ - s x r(r T s x r)- 1 r T )s x rs^ 1 r T s x = o. 

Thus var[(J - P r (Sx)) T ^] = 0, which implies U = Pf (E X )U. That is, U 
and Y T U in fact generate the same u-field, and hence the second relation in 

(17) must hold. 

Next, by the definition of U we have 

U = E X T^(F T X + S). 
Multiply both sides of this equation from the left by T T , to obtain 

r T u = r T z x rz^(r T x + 5). 

Let U = T T U and X = T T X. Then E~ TI/ = T T T, X T and S~ = r T EuF. In 

xw u u 

this new coordinate system the above equation can be rewritten as 

U = ^ xw ^w 1 (X + 5). 

Because (i) X has a multivariate normal distribution with £ ~ = T T T, X T > 

and (ii) 5 JL (X, F) and 5 ~ N(0, £5) with Y,$ > 0, we have, by Theorem 3.1, 
c c „ 

Now let q be the dimension of S Y \ X and suppose that (3 is a p by q 
matrix whose columns form a basis of Sy\ X - We note that q <r. Because 
span(/3) C span(r), there is an r by q matrix r\ of rank q such that (3 = Trj. 
The following string of implications is evident: 

F JL X\j3 T X Y JL X\r] T T T X Y JL X|r/ T X 

F JL T T X|r/ T X ^ F JL X\r) T X. 

This means span(7y) is a dimension reduction space for the problem Y\X, 
and hence, because 5,,, ~ = S„.~, it must also be a dimension reduction 

F Y \U 

space for the problem Y\U. It follows that F JL U\r] T U or equivalently 

(18) Y ALT T U\rj T T T U. 

In the meantime, because T T U and (r r C/, rj T T T U) generate the same cr-field, 
the second relation in (17) implies 

(19) F ALU\{T T U,r] T r T U). 
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By Proposition 4.6 of Cook [7], relations (18) and (19) combined imply that 

Y JL (U, V T U)\rj r T T U => Y JL U\r, T Y T U ^Y JL U\(3 T U, 

from which it follows that Sy\u Q Sy\x- 

To show the reverse inclusion S Y \x Q Sy\Ui ^ s be the dimension of S Y \u 
and £ be a p by s matrix whose columns form a basis of Sy\u- By the second 
conditional independence in (17) we have span(£) C span(r). Hence s < q, 
and there is an r by s matrix £ of rank s such that £ = r£. Follow the proof 
of the first inclusion to show that 

Y ALT T X\C T T T X. 

In the meantime, because T T X and (r T X, £ r r T X) generate the same a- 
field, the first conditional independence in (17) implies that 

Y ALX\(T T X, ( T T T X). 

Now follow the proof of the first inclusion. □ 

5. Invariance of surrogate dimension reduction for conditional moments. 

We now establish the invariance law between the central mean (or moment) 
spaces of the surrogate and the original dimension reduction problem. As 
briefly discussed in the Introduction, if there is a p by d matrix (3 with d < p 
such that for k = 1, 2, ... , 

(20) E(Y k \X) = E(Y k \p T X), 

then we call the column space of (3 a dimension reduction space for the feth 
conditional moment. Similar to the previous case, the intersection of all such 
spaces again satisfies (20) under mild conditions. We call this intersection 
the fcth central moment space, and denote it by S E / Y k\x)- Let S E (Yk\u) be 
the fcth central moment space for Y versus U . The goal of this section is to 
establish the invariance relation 

(21) S E{ yk\ X ) = •S E (Y k \U)- 

The next lemma parallels Lemma 3.1. Its proof will be given in the Appendix. 

Lemma 5.1. Let U{ , V{, V 2 *, Y be as defined in Lemma 3.1 and sup- 
pose assumptions 1 and 2 therein are satisfied. Let h(Y) be an integrable 
function ofY. Then: 

1. // there is an r- dimensional multivariate normal random vector V£ 
such that V 3 * _U_ (V? - V 3 * , V 2 *), and if U{ JL U%, then E[h(Y)\V* , V 3 *\ = 
E[h(Y)\V 3 *} implies E[h(Y)\U*] = E[h{Y)\U{] . 

2. // there is an r- dimensional multivariate normal random vector U 3 
such that C/ 3 * JL {U{ - C/J, t/J), and if V{ JL V£, then E[h(Y)\U* , = 
E[h(Y)\U£] implies E[h(Y)\V*] = E[h(Y)\V*]. 
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The next theorem establishes the invariance law (21). 

Theorem 5.1. Suppose k is any positive integer and: 

1. X~ N(fi X ,^x), where S x > 0, 

2. 5 JL (X, Y) and 5 ~ iV(0, Ej) , w/iere E a > 0, 

3. £(|y| fc )<oo. 

The proof is similar to that of Theorem 3.1; the only difference is now we 
use Lemma 5.1 instead of Lemma 3.1. Evidently, we can follow the same steps 
in Section 4 to show that the assertion of Theorem 5.1 holds for general T. 
We state this generalization as the following corollary. The proof is omitted. 

Corollary 5.1. Suppose that T in (1) is a p by r matrix with r < 
p, and that T has rank r. Suppose that S and X are multivariate normal 
with S x > 0, E s > 0, E{5) = 7 and 5 JL (X,Y). Suppose that E(\Y\ k ) < oo. 
Furthermore, suppose that S E ryk\x) ^span(r). Then S E (yk\u) = <SE(Y k \x) ■ 

6. Approximate invariance for non-Gaussian predictors. In this section 
we establish an approximate invariance law for arbitrary predictors. This 
is based on the fundamental result that, when the dimension p is reason- 
ably large, low-dimensional projections of the predictor are approximately 
multivariate normal. See Diaconis and Freedman [12] and Hall and Li [17]. 
Although this is a limiting behavior as p — > oo, from our experience the 
approximate normality manifests for surprisingly small p. For example, a 1- 
dimensional projection of a 10-dimensional uniform distribution is virtually 
indistinguishable from a normal distribution. Thus the multivariate normal- 
ity holds approximately in wide application. Intuitively, if we combine the 
exact invariance under normality, as we developed in the previous sections, 
and the approximate normality when p is large, then we will arrive at an ap- 
proximate invariance law for large p. This section is devoted to establishing 
this intuition as a fact. 

We rewrite quantities such as X, U, S, (3 in the previous sections as X p , U p ,5 p , 
Pp. Let EP denote the unit sphere in W:{x £ W : \\x\\ = 1}, and Unif(S p ) 
denote the uniform distribution on § p . The result of Diaconis and Freedman 
[12] states that, if (3 P ~ Unif(S p ), then, under mild conditions the condi- 
tional distribution of (3 p X p \f3 p converges weakly in probability (w.i.p.) to 
normal as p ^ oo. That is, the sequence of conditional characteristic func- 
tions E{e itf3 p x p\p p ) converges (pointwise) in probability to a normal char- 
acteristic function. Intuitively, this means when p is large, the distribution 
of P p X p is nearly normal for most /3 p 's. Here, the parameter /3 p is treated 
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as random to facilitate the notion of "most of /3 p ." We will adopt this as- 
sumption. In Diaconis and Freedman's development the X p is treated as 
nonrandom, but in our case (5 p ,X p ,Y) is random. In this context it makes 
sense to assume /3 p _LL (Y,X p ,S p ), which would have been the case if the 
data (Y,Xp,5 p ) were treated as fixed. With j3 p being random, the dimension 
reduction relation should be stated as Y _LL X p \(/3 p X p , j3 p ). 

Our goal is to establish that, if Y _LL X p \(f3 p X p ,f3 p ), then, in an approxi- 
mate sense Y _LL U p \(0FU p , (3 P ) , and vice versa. To do so we need to define an 
approximate version of conditional independence. Recall that, in the classi- 
cal context when p is fixed and (3 is nonrandom, Y JL U\0 I X if and only 
if Y JL t T U\(3 T X for all t S W, as can be easily shown using characteristic 
functions. The definition of approximate conditional independence is analo- 
gous to the second statement. 

Definition 6.1. Let (3 P be a p x d dimensional random matrix whose 
columns are i.i.d. Unif(S p ) and (3 P JL (U p ,Y). We say that Y and U p are 
asymptotically conditionally independent given (0£U P , f3 p ), in terms of weak 
convergence in probability if, for any random vector Q p satisfying Q p ~ Unif (§> p ) 
and Cp-U- (P P ,U P ,Y), the sequence (Y, f3^U p X p U p )\(P p , C P ) converges w.i.p. 
to (Y*,U*,V*) in which Y* JL V*\U*. If this holds we write 

Y JL Up\ (PpU p , j3 p ) w.i.p. as p — > oo. 

The following lemma gives further results regarding w.i.p. convergence 
that will be used in the later development. Its proof will be given in the 
Appendix. 

Lemma 6.1. Let {R p }, {S p }, {T p } and {/3 P } be sequences of random 
vectors in which the first three have dimensions not dependent on p. Then: 

1. Let R* be a random vector with the same dimension as R p and de- 
note by 4>p{t;j3p) and oj{t) the characteristic functions E(e itTRp \f3 p ) and 
E(e ltTR *), respectively. Then R p \j3 p — > R* w.i.p. if and only if, for each 

tew, 

(22) E<t> p {t;f3 p )^u{t), E\<f> p {t; (3 P )\ 2 \uj(t)\ 2 . 

2. If (Rp,Sp,Tp)\P p -> (R*,S*,T*) w.i.p. and R p JL S p \(T p ,/3 p ) for all p, 
then R* AL S*\T* . 

Expression (22) is used as a sufficient condition for w.i.p. convergence in 
Diaconis and Freedman [12]; here we use it as a sufficient and necessary 
condition. In the next lemma, || • \\p will denote the Frobenius norm. Let M p 
be the random matrix (U p , X p , Y,u Yi^ 1 X p ) and M p be an independent copy 
of M p . 



SURROGATE DIMENSION REDUCTION 



15 



Lemma 6.2. If\\Z Xp f F = o(p 2 ), \\X Up \\ 2 F = o(p 2 ) and \\V Up Y#Y. Uv f = 
o(p 2 ), thenp- 1 M^M p = o P (l). 

This will be proved in the Appendix. The convergence p~ 1 M p M p = op(l) 
was used by Diaconis and Preedman [12], as one of the two main assumptions 
[assumption (1.2)] in their development, but here we push this assumption 
back onto the structure of the covariance matrices. (More precisely, they 
used a parallel version of the convergence because they treat the data as a 
nonrandom sequence.) Conditions such as HE^Hf = o(p 2 ) are quite mild. 
To provide intuition, recall that, if £ p is a p x p matrix, and X±, . . . ,\ p are 
the eigenvalues of £ p and A max = max(Ai, . . . , A p ), then 

p 

i=l 

Hence the condition ||£ p ||^ = o(p 2 ) will be satisfied if A max = o(y/p). 
To streamline the assumptions, we make the following definition. 

Definition 6.2. We will say that a sequence of p x p matrices {S p :p = 
1,2,...} is regular if p~ l tr(£ p ) — > a 2 and ||S p |||. = o(p 2 ). 

We now state the main result of this section. 

Theorem 6.1. Suppose that (3 p is apx d random matrix whose columns 
are i.i.d. Unif(S p ). Suppose, furthermore, that: 

1. {YiXp}, {^Up} an d {Sf/pS^Sf/p} are regular sequences with o~\, afj and 
a 2 , being the limits of their traces divided by p as p — > oo . 

2. 5 P JL (X p , Y) and f3 p _U_ (X p , Y, 5 P ) . 

3. p~ 1 M p r M p = E(p- 1 M p r M p ) + o P (l). 

IfY JL X p \ (P p Xp, (3 p ) and the conditional distribution ofY\ {(3 p X p = c, (5 P ) 
converges w.i.p. for each c, then Y JL U p \f3 p U p , [3 P w.i.p. as p— > oo. 

IfY JL Up\(P p Up, f3 p ) and the conditional distribution ofY\((3 p U p = c,(3 p ) 
converges w.i.p. for each c, then Y JL X p \(f3 p X p , (3 P ) w.i.p. as p— >oo. 

The condition that "the conditional distribution of Y \ (P p X p = c, (3 P ) con- 
verges w.i.p. for each c" means that the regression relation stabilizes as 
p — > oo. Assumption 3 is parallel to the other of the two main assumptions 
in Diaconis and Preedman [12], Assumption 1.1. This is also quite mild: for 
example, it can be shown that if X p and 5 P are uniformly distributed on a 
ball {x € W : \\x\\ < p} and if the covariance matrices involved satisfy some 
mild conditions, then assumption 3 is satisfied. For further discussion of this 



16 



B. LI AND X. YIN 



assumption see Diaconis and Freedman [12], Section 3 — though it is given 
in the context of nonrandom data, parallel conclusions can be drawn in our 
context. 



Proof of Theorem 6.1. For simplicity, we will only consider the case 
where d = 1; the proof of the general case is analogous and will be omitted. In 
this case (3 P ~ Unif (SP). Let ( p ~ Unif (S p ) and ( p JL (f3 p , X p , 5 P , Y). Following 
Diaconis and Freedman [12], we can equivalently assume (3 p ~ N(0, I p /p) and 
( p ~ N(0, Ip/p), because these distributions converge to Unif (S p ) as p — > oo, 
and thus induce the same asymptotic effect as Unif(S p ). To summarize, we 
equivalently assume 

Pj,~N(0,I p /p), 

( p ~N(0,I p /p), f3 p ALC P , (f3 p ,(p)^(X p ,5 p ,Y). 

To prove the first assertion of the theorem, let 

Ui, p = PjUp, U2, p = ( p U p , 

Vi tP = 0£E Up Ex p X p , V 2 , p = (jE Up E Xp X p , V 3iP = (afj/a x )PjX p . 

Our goal is to show that, as p — > oo, 

(Y, U 1>p , U 2 , P , V 1>p , V 2jP , V 3 , p )\(Pp, C P ) - (Y, Ul, U^V^V^Vi) w.i.p., 

where Y JLU$\Uf. 

Let ((3 P , C P ) = rjp and L p = (U ltP , U 2 , p , V\, P , V 2tP , V 3jP ) T . Then 



L p = A(I 2 ®M p T )vec(r ] p), 



where A 
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We first derive the w.i.p. limit of (I 2 ® Mp)vec(r}p)\r) p . Its (conditional) 
characteristic function is 

<f> p (t; v P ) = E(e itT{I ^ M ^ Tvc < ri ^ \rj p ), where t G M. 2p . 

Because vec(r]p) ~ N(Q,I 2p /p) and ?7 P JL M p , we have 

E^pit-rjp)] = E[E(e uT ^ M ^ Tvcc M\M p )] = E( e -(VCfr))l|t T (*.®«iO T ll a ). 

By assumption 3 and assumption 1, 

(23) p^M^Mp = p^EiMjMp) + o P (l) - 
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Thus, by the continuous mapping theorem (Billingsley [1], page 29), 

(24) e -(i/(2 P ))\\t T (h®M P ) T \\ 2 _Z^ e -{i/2)t T (i 2 ®r)t_ 

Because the sequence {e _ ( 1 /( 2p ))H* T ( /2 ® Mp ) T H 2 } is bounded, we have 

(25) E<p p (t;Vp) = £[ e -(V(2p))ll* T (/2®M P Ff] ^ e -(i/2)^(j a ®Dt_ 
In the meantime, 

^[I^C*;^)! 2 ] = ^[^(e ifr(/2 ^ A ^ )rvec( ^ ) |7 7 p)^(e- 1 * r(/2 ®^ )rvec( ^ ) |r ? p)]. 

If we let M p be a copy of M p such that M p JL (M p ,rf p ), then the right-hand 
side can be rewritten as 

S[^(e iiT[(72 ® Mp)T " (720 ^ )T]vcc ^ p) |Mp,Mp)] 
= £( e -(V(2 P ))||i T [/ 2 ®(M p -M p )ni 2 ^ 

By Lemma 6.2 and convergence (23), 

(M p - Mpf(Mp - M p ) = MjMp - M^Mp - Mj ' M p + MjM p 4 2r. 
Again, by the continuous mapping theorem, 

e -(i/(2 P ))||t T [h®{M P -M P )] T \\ 2 _^ e -||t T (wn 2 _ 
Because the sequence {e~(V( 2 P))IIM 7 2®(M p -Mj,)] || 2 j i s bounded, we have 

(26) E|^(t,s;ik)| 2 =E[e-(V(%^ 
By part 1 of Lemma 6.1, (25) and (26), 

(I 2 ® Mp)vec(r]p)\r]p^ N(0,l2 <8> T) w.i.p. asp^oo. 
Thus the w.i.p. limit of L p \rf p is N(0,A(l2 ®T)A T ). By calculation, 

cov(^,c/ 2 *) = o, cov(y 2 *,y 3 *) = o, 

4 4 

cov(^ - Vi, V 3 *) = cov(V?,V?) - cov(V 3 *, V z *) = ^ r -^ r = 0. 

°x a x 

Hence, by multivariate normality we have Uy JL £/| and Kj* JL (Vj* — Kj*, V 2 *). 
Also, recall from (15) that U p — T,u p T,^X p = Hx p Wp^w p ^p anc ^' by assump- 
tion 2, <5 p JL (X p ,y)|ry p . Consequently, 

rgu p - i%v v ?£x p jl (x p ,y)|v 

By part 2 of Lemma 6.1, U* - V* JL (7*, Y). So condition 2 of Lemma 3.1 
is satisfied. 
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Let L* denote the w.i.p. limit of L p . We now show that (Y,L p ) T — > 
(Y*,L* T ) T w.i.p. for some random variable Y* and 

(27) Y* AL (V*,Vi)\V 3 *. 
Since, given rj p , L p is a function of X p and 5 P , we have 

E (e lTY \L P , rjp) = E[E(e irY \X p , 8 P , r} p )\L p , rj p ]. 

Since Y AL X p \((3jXp, (3 P ) and ( p AL (X p , 5 P , Y, (3 P ) , we have Y AL X P \(^X P , 
rjp). Also, since 5 P _LL (X p , Y,rj p ), we have Y _LL S p \(X p ,rj p ). Hence 

E(e iTY \X p ,5 p ,r, p ) = E(e irY \X p ,r, p ) = E(e lTY \V 3>p ,rj p ). 

Thus E(e \L p ,rjp) = E[e %rY \ V 3 ^ p , /3 P ), or equivalently 

(28) YAL(Lp, Vp )\(V 3) p,(3p). 

It follows that the conditional distribution of Y\L p ,rjp is the same as the 
conditional distribution of Y\(0E X p , f3 p ) . Let /i(-|c) be the w.i.p. limit of 
Y\(PpX p = c,{3p), and draw the random variable Y* from fi(- \ V{). Then 
(Y, L p )\r]p — > (y*, L*) w.i.p. In the meantime (28) implies that Y _LL L p \ (V^^, r/ p ). 
Hence, by part 2 of Lemma Q.l, Y AL L*\V 3 * , which implies (27). Thus all the 
conditions for assertion 1 of Lemma 3.1 are satisfied, and the first conclusion 
of this theorem holds. 

The proof of the second assertion is similar, but this time let 

U\, P = PpT.XpT.^Up, f/ 2 ,p = CjSx p S^C^p, U 3jP = (ctx/(Tu)PJU p , 

Vi lP = (3 p X p , V 2 , p = QpXp 
and use the second part of Lemma 3.1. The details are omitted. □ 

We now use a simulated example to demonstrate the approximate imparl- 
ance law. 

Example 6.1. Still use the model in Example 3.1, but this time, as- 
suming the distribution of X is nonnormal, specified by 

X p = ZZ p <S>{\\Z p \\)/\\Z p \\, 

where $ is the c.d.f. of the standard normal distribution and Z p ~ N(0,I p ). 
Thus, conditioning on each line passing through the origin, X p is uniformly 
distributed. Note that this is different from a uniform distribution over a 
ball in R p , but it is sufficiently nonnormal to illustrate our point. Figure 2 
presents the scatter plots of Y versus f3fX p , 0^X p and the scatter plots of 
y versus PfUp, f3jU P - We see that, even for p as small as 6, Sy\u p already 
very much resembles Sy\x v f° r nonnormal predictors. In fact, we can hardly 
see any significant difference from Figure 1, where the invariance law holds 
exactly. 
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Fig. 2. Comparison of the original and surrogate dimension reduction spaces for a non- 
normal predictor. Left panels: Y versus Pi X (upper) and Y versus 02 X (lower), 
panels: Y versus pxU (upper) and Y versus 02 U (lower). 



Although we have only shown the asymptotic version of the invariance 
law (3) for nonsingular p x p dimensional T, similar extensions can be made 
for arbitrary T (with r < p) , as well as the invariance law (4) . Because 
the development is completely analogous they will be omitted. Also notice 
that the assumptions for Theorem 6.1 impose no restriction on whether X 
is a continuous random vector; thus the theorem also applies to discrete 
predictors — so long as its conditions are satisfied. 

7. Estimation procedure and convergence rate. Having established the 
invariance laws at the population level, we now turn to the estimation proce- 
dure and the associated convergence rate for surrogate dimension reduction. 
Since we are no longer concerned with the limiting argument as p — > oo, we 
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will drop the subscript p that indicates dimension. Instead the subscripts in 
(Xi,Yi) will now denote the ith case in the sample. 

In the classical setting where measurement errors are absent, a dimension 
reduction estimator usually takes the following form. Let (Xi,Y\), . . . , (X n , Y n ) 
be an i.i.d. sample from (X,Y). Let Fxy be the distribution of (X,Y), and 
Fn,XY be the empirical distribution based on the sample. Let T be the class 
of all distributions of (X,Y), and Q be the class of all p by t matrices. 
Let T :T ^ Q be a mapping from T to Q . Most of the existing dimension 
reduction methods, such as those described in the Introduction, have the 
form of such a mapping T, so chosen that (i) span(T(Fxy)) C Sy\x-> and (h) 
T(F n: xY) = T{Fxy) + A n , where A n is o p (l) or O p (l/y/n) depending on the 
estimators used. If these two conditions are satisfied with A n = O p (l/y / n) 
then we say that T(F nj xy) is a -^/re-consistent estimator of S Y \ X ; if; i n addi- 
tion, (i) holds with equality then we say that T(F nt xY) is a -y/n-exhaustive 
estimator of Sy\x- See Li, Zha and Chiaromonte [18]. 

The invariance law established in the previous sections tells us that we 
can apply a classical dimension reduction method to the adjusted surrogate 
predictor Ui,...,U n and, if it satisfies (i) and (ii) for estimating Sy\u ( or 
•Sjvyfcijn), then it also satisfies these properties for estimating Sy\x ( or 
S E ryk\x))- Of course, here the adjusted surrogate predictor U is not directly 
observed, because it contains such unknown parameters as ^xw and ^w- 
However, these parameters can be estimated from an auxiliary sample that 
contains the information about the relation between X and W. As discussed 
in Fuller [15] and Carroll and Li [3], in practice this auxiliary sample can be 
available under one of the several scenarios. 

1. Representative validation sample. Under this scenario we observe, in ad- 
dition to (Wi,Yi), . . . , (W n , Y n ), a validation sample 

(29) (X_i,W_i),...,(X_ mj W_ m ). 

We can use this auxiliary sample to estimate Y>xw and £jy by the method 
of moments, 

Zxw = E m [{X -X)(W- Wf] , t w = E m [(W - W) (W - W) T ] , 

where E m denotes averaging over the representative validation sample 
(29). We can then use the estimates T,xw and Ejy to adjust the surrogate 
predictor Wi as Ui = E X w^w W i' and estima te S Y \ X by T(F ^ y ). 
Here, F m jj y is F h> uy with U replaced by U ; we have added the subscript 
m to emphasize the dependence on m. 

2. Representative replication sample. In this case we assume that p = r and 
that r is known, which, without loss of generality, can be taken as I p . We 
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have a separate sample where the true predictor Xi is measured twice 
with measurement error. That is, 

(30) Wtj = j + Xi + 5ij, j = 1, 2, i = -1, . . . , -m, 

where {<%} are i.i.d. iV(0, £$), {Xi} are i.i.d. N(0,Y^x) and {<%} JL 
From the replication sample we can extract information about 
Tjxw and as follows. It is easy to see that, for i = —1, . . . , — m, 

var(Wa - W a ) = 2X S , var(W a + W i2 ) = 4S X + 2£ 5 , 

from which we deduce 

Es = y&T(Wa-W i2 )/2, 

S w = var(Wii + W i2 )/A + var(W i i - W i2 )/A. 

We can then estimate these two variance matrices by substituting in the 
right-hand side the moment estimates of vax(Wn + Wi 2 ) and var(Wji — 
Wi 2 ) derived from the replication sample (30). Because in this case 

^xw^w = I P - ^s^w-i 

we adjust the surrogate predictor Wi as Ui = (I p — SjE^r )Wj. 

A variation of the second sampling scheme appears in Fuller [15], which will 
be further discussed in Section 9 in conjunction with the analysis of a data 
set concerning managerial behavior. 

Under the above schemes T,xw and Y<w can be estimated by Y>xw and 
T*w at the y^m-rate, as can be easily derived from the central limit theorem. 
Hence F ~ Y = F n jj Y + O p (l/y / m). Meanwhile, by the central limit theo- 
rem, we have F n jjy = F\j Y + O p (l/y / n). The dimension reduction estimator 
T(F n m jjy) can ^ e decomposed as 

(31) 

= T(F UY ) + [T(F n ^ dY ) - T(F n>UY )] + [T{F n , UY ) - T(F UY )]. 

Usually the mapping T is sufficiently smooth so that the second term is of 
order O p (l/y/m) and the third term is of the order O p (l/y/n). That is, 

(32) T(F ntmXfY ) = T(F UY ) + O p (l/v^) + O p (l/^). 

While it is possible to impose a general smoothness condition on T for the 
above relation in terms of Hadamard differentiability (Fernholz [14]), it is 
often simpler to verify (32) directly on an individual basis. The next exam- 
ple will illustrate how this can be done for a specific dimension reduction 
estimator. 
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Example 7.1. Li, Zha and Chiaromonte [18] introduced contour regres- 
sion (CR) which can be briefly described as follows. Let 

(33) H x (c) = E[(X -X){X- X) T I(\Y -Y\< c)], 

in which (X, Y) and (X,Y) are independent random pairs distributed as 
Fxy ) and c > is a cutting point roughly corresponding to the width of a 

— 1/2 —1/2 

slice in SIR or SAVE. Let v p ^ q+ i ,v p be the eigenvectors of T, x Hx^> x 
corresponding to its q smallest eigenvalues. Then, under reasonably mild as- 
sumptions, 

(34) span(S x 1/2 v p , . . . , T, x 1/2 v p ^ q+1 ) = S Y \x- 
Thus, the mapping T in this special case is defined by 

T(Fxy) = {T, x 1/2 v p , YT x 1/2 v p - q+ i). 
For estimation we replace Hx and 'Ex by their sample estimators, 

H x (c)=( n 2 Y 1 £ [(Xj-XiXXj-XiflQfi-YilKc)], 

(35) 

S x = E n [(X - X)(X - Xf], 

where, in the first equality, N is the index set : 1 < j < i < n} . 

The motivation for introducing contour regression is to overcome the dif- 
ficulties encountered by the classical methods. It is well known that if the 
relation Y\X is symmetric about the origin then the population version of 
the SIR matrix is 0, and if E(Y\X) is linear in X then the population ver- 
sion of the pHd matrix is zero. In these cases, or in situations close to these 
cases, or in a combination of these cases, SIR and pHd tend not to yield 
accurate estimation of the dimension reduction space. Contour regression 
does not have this drawback because of the property (34). 

In the context of measurement error regression the true predictor X{ is 
to be replaced by £/j. For illustration, we adopt the first sampling scheme 
described above. Let Xjyi and £^2 be the sample estimates of £^ based on 
the primary sample { W\ W n } and the auxiliary sample { W_i , . . . , W- m }, 
respectively. Let H^j(c) be the matrix Hx(c) in (35) with Xi,Xj replaced 
by Ui, Uj. Then, it can be easily seen that 

(36) 

where in the first equality Hw(c) is Hx(c) in (35) with Xi,Xj replaced by 
Wi,Wj. Because T*xw and ^wi are based on the auxiliary sample, they 
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approximate Y*xw and £jy at the -y/m-rate. Because £wi is based on the 
primary sample, it estimates Y>y/ at the y/n-x&te. Consequently, £~ = £[/ + 
O p (l/y/m) + P {1/ yjn), which implies 

£~ 1/2 = s~ 1/2 + Op(i/v^) + Op(iM^). 

In the meantime, it can be shown using the central limit theorem for U- 
statistics (see Li, Zha and Chiaromonte [18]) that H\y(c) = Hw(c) + O p (l/y/n), 
where H\y(c) is Hx(c) in (33) with X, X replaced by W,W. Hence 

(37) " 

= Hu(c) + O p (l/^) + O p {l/yftR). 
Combining (36) and (37) we see that 

t~y 2 H d t d 1/2 = Z- 1/2 HuZ- 1/2 + Op(l/v^) + Op( vv^). 

It follows that the eigenvectors . . . , t) p of the matrix on the left- 

hand side converge to the corresponding eigenvectors of the matrix on the 
right-hand side, v p - q +i, . . . , v p , at the rate of O p (l/y/n) + O p (l/y/rn), and 
consequently, 

^ d 1,2 Vi = E~ 1/2 ^ + O p (l/v^) + Op(l/v^)- 
Thus we have verified the convergence rate expressed in (32). 

It is possible to use the general argument above to carry out asymptotic 
analysis for a surrogate dimension reduction estimator, in which both the 
rates according to m and n are involved. This can then be used to construct 
test statistics. But because of limited space this will be developed in future 
research. Special cases are available for SIR and OLS (Carroll and Li [3]) 
and for pHd (Lue [22]). 

8. Simulation studies. As already mentioned, a practical impact of the 
invariance law is that it makes all dimension reduction methods accessible 
to the measurement error problem, thereby allowing us to tackle the difficult 
situations that the classical methods do not handle well. We now demon- 
strate this point by applying SIR, pHd and CR to simulated samples from 
the same model and comparing their performances. 

We still use the model in Example 3.1, but this time take T = I p . Under 
this model T,w = Sx + £<5 and T<xw = ^x- We take an auxiliary sample of 
size m = 100. The standard deviations as and a e are taken to be 0.2, 0.4, 
0.6. For the auxiliary sample, we simulate {Wij\j = 1,2,% = — 1, m} 
according to the representative replication scheme described in Section 7. 
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Table 1 

Comparison of different surrogate dimension reduction methods 



cr £ 




SIR 


pHd 


CR 




0.2 


1.26 ±0.63 


1.35 ±0.56 


0.12 ±0.07 


0.2 


0.4 


1.33 ±0.58 


1.56 ±0.51 


0.16 ±0.09 




0.6 


1.46 ±0.53 


1.57 ±0.46 


0.32 ±0.21 




0.2 


1.44 ±0.57 


1.41 ±0.51 


0.12 ±0.08 


0.4 


0.4 


1.34 ±0.57 


1.5 ±0.5 


0.20 ±0.13 




0.6 


1.36 ±0.59 


1.62 ±0.48 


0.34 ±0.19 




0.2 


1.36 ±0.62 


1.50 ±0.55 


0.13 ±0.08 


0.6 


0.4 


1.44 ±0.53 


1.53 ±0.49 


0.21 ±0.19 




0.6 


1.48 ±0.53 


1.70 ±0.45 


0.32 ±0.18 



For each of the nine combinations of the values of a e and as, 100 samples of 
{(Xi,Yi)} and {Wjj} are generated according to the above specifications. 

The estimation error of a dimension reduction method is measured by the 
following distance between two subspaces of W. Let S± and S2 be subspaces 
of MP, and Pi and P2 be the projections onto S± and S2 with respect to the 
usual inner product (a, b) = a T b. We define the (squared) distance between 
5i and S2 as 

p{S 1 ,S 2 ) = \\Pi - P 2 \\ 2 , 

where || • || is the Euclidean matrix norm. The same distance was used in 
Li, Zha and Chiaromonte [18], which is similar to the distance used in Xia 
et al. [25]. 

For SIR, the response is partitioned into eight slices, each having 50 ob- 
servations. For CR, the cutting point c is taken to be 0.5, which roughly 
amounts to including 12% of the ( 4 2°) = 79800 vectors Ui — Uj correspond- 
ing to the lowest increments in the response. The results are presented in 
Table 1 . The symbol a ± b in the last three columns stands for mean and 
standard error of the distances p(Sy\x^Y\x) over the- 100 simulated sam- 
ples. We can see that CR achieves significant improvement over SIR and 
pHd across all the combinations of as and a £ . This is because the regression 
model (16) contains a symmetric component in the (3\ direction, which SIR 
cannot estimate accurately, and a roughly monotone component in the P2 
direction, which pHd cannot estimate accurately. In contrast, CR accurately 
captures both component. 

To provide further insights, we use one simulated sample to see the com- 
parison among different estimators. Figure 3 compares the performance of 
SIR, pHd and CR in estimating Sy\u (or Sy\x)- We see that SIR gives a 
good estimate for but a poor estimate for /?i , the opposite of the case for 
pHd, but CR performs very well in estimating both components. 
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Fig. 3. Surrogate dimension reduction by SIR, pHd and CR. Left panels: Y versus the 
second (upper) and the first (lower) predictors by SIR. Middle panels: Y versus first (upper) 
and second (lower) predictors by pHd. Right panels: the second (upper) and the first (lower) 
predictors by CR. 



9. Analysis of a managerial role performance data. In this section we ap- 
ply surrogate dimension reduction to a role performance data set studied in 
Warren, White and Fuller [24] and Fuller [15]. To study managerial behavior, 
n = 98 managers of Iowa farmer cooperatives were randomly sampled. The 
response is the role performance of a manager. There are four predictors: 
X\ (knowledge) is the knowledge of the economic phases of management 
directed toward profit-making, Xi (value orientation) is the tendency to ra- 
tionally evaluate means to an economic end and X% (role satisfaction) is 
gratification obtained 

(training) is the amount of formal education. The predictors Xi, X2 and 
X3, and the response Y are measured with questionnaires filled out by the 
managers, and contain measurement errors. The amount of formal educa- 
tion, X4, is measured without error. Through dimension reduction of this 
data set we wish to see if the linear assumption in Fuller [15] is reasonable, 
if there are linear combinations of the predictors other than those obtained 
from the linear model that significantly affect the role performance, and how 
different surrogate dimension reduction methods perform and compare in a 
real data set. 

The sampling scheme is a variation of the second one described in Section 
7. A split halves design of the questionnaires yields two observations on each 
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item with measurement error for each manager, say 

<y iU Va\ (W ill ,W il2 ),...,(W l31 ,W i32 ), i = l,...,n, 

where (Va, V^) are the two measurements of Y{ and (Wiji,Wij 2 ), j = 1, 2, 3, 
are the two measurements of Xij. The surrogate predictors for Xij, j = 
1,2,3, are taken to be the averages Wij = (W%ji + Wij 2 )/2. Similarly, the 
surrogate response for Yi is taken to be Vi = (Va + Vi 2 )/2. Since the mea- 
surement error in Y does not change the regression model, we can treat V 
as the true response Y. As in Fuller [15], we assume: 

1. V ik = Y, + £ ik , Wijk = + rjijk, i = 1, • ■ • , n, j = 1, 2, 3, k = 1, 2, where 

{(£ik,Viik,Vi2k,Vi3k)'-i = l,...,n,k = 1,2} 
JL{(y i ,X il ,...,X i4 ):i = l,...,n}. 

2. The random vectors ^iifc, Vi2k,Vi3k) '■ i = 1, ■ ■ ■ ,n} are i.i.d. 4-dimension- 
al normal with mean and variance matrix 

i - / 2 2 2 2 \ 

diag(a c ,o- r7jl ,a r?i2 ,fT^ 3 ). 

3. {(Xa,...,Xj 4 ):n = l,...,n} are i.i.d. N^x^x)- 

From these assumptions it is easy to see that, for j = 1, . . . , 4 and i = 1, . . . , n, 
Wy = + (5 y - , where 

{(5n, . ..,Su) :i = 1, ...,re}JL {(JQi,. . . ,^4,^) :i = 1, . . .,n}. 

and {(<5ji, . . . , ^4) : i = 1, . . . , n} are i.i.d. mean and variance matrix 

E 5 = diag(crf 1, crL, ah, 0), erf ■ = ± var(Wyi - W ij2 ), 

(38) 

i = 1,2,3. 

Thus, at the population level, our measurement error model can be summa- 
rized as 

W = X + 5, 5AL(X,V), 5~iV(0,£ 5 ), X ~ JV(^x, 

where £5 is given by (38). Note that, unlike in Fuller [15], no assumption is 
imposed on the relation between Y and X. 

The measurement error variance £5 is estimated using the moment esti- 
mator of (38) based on the sample {(Wiji, Wij 2 ) '■ i = 1, • • • , 98, j = 1, . . . , 3}, 
as 

S 5 = diag(0.0203, 0.0438, 0.0180, 0). 

The variance matrix £vk of Wi = (Wn, . . . , Wa) T is estimated from the sam- 
ple {Wi:i = l,...,98} as 

/0.0520 0.0280 0.0044 0.0192 \ 

0.0280 0.1212 -0.0063 0.0353 

0.0044 -0.0063 0.0901 -0.0066 ' 

\ 0.0192 0.0353 -0.0066 0.0946 / 
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-2-1012 -3-2-1012 
SIR1 CR1 



Fig. 4. Scatter plots of role performance versus the first predictor by SIR (left) and 
versus the first predictor by CR (right). 

The correction coefficients Tixw^w are then estimated by I4 — S^S^ 1 , and 
the surrogate predictor W{ corrected as E7j = (I4 — X^X^Wj. 

We apply SIR and contour regression to the surrogate regression problem 
of Vi versus As in Fuller [15], a portion of the data (55 out of 98 subjects) 
will be presented. For SIR, the responses of 55 subjects are divided into five 
slices, each having 11 subjects. For CR, the cutting point c is taken to be 0.1, 
which amounts to including 552 of the ( 5 2 5 ) = 1485 (roughly 37%) differences 
U{ — Uj corresponding to the lowest increments \Yi — Y~\ of the response. In 
fact, varying the number of slices for SIR or the cutting point c for CR 
within a reasonable range does not seem to have a serious effect on their 
performance. 

Figure 4 shows the scatter plots of Y versus the first predictors calculated 
by SIR (left) and CR (right). None of the scatter plots of Y versus the second 
predictors by SIR and CR shows any discernible pattern, and so they are 
not presented. Because there is no [7-shaped component in the data, the 
accuracy of SIR and CR are comparable. These plots show that the linear 
model postulated in Warren, White and Fuller [24] and Fuller [15] does 
fit this data, and there do not appear to be other linear combinations of 
the predictors that significantly affect the role performance. The directions 
obtained from CR, SIR, and that using the maximum likelihood estimator 
for a linear model, are presented in Table 2 (the vectors are rescaled to have 
lengths 1). 

Note that SIR, as applied to the adjusted surrogate predictor U, is the 
estimator proposed in Carroll and Li [3]. We can see that for this data set the 
three methods are more or less consistent, though CR gives more weight to 
past education than the other methods. Of course, the significance of these 
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parameters should be determined by a formal test based on the relevant 
asymptotic distributions. Such asymptotic results are available for pHd and 
SIR, and are under development for CR. 

APPENDIX 

Proof of Lemma 5.1. Note that the conditions for equality (9) are 
still satisfied. Multiply both sides of (9) by h(Y) and then take expectation, 
to obtain 

E[h{Yy tTu *]e^ tT ^* t = ElhiYy^WeW^v**. 

To prove the first assertion, suppose there is a V3* such that V 3 * li_ (V* — 
V 3 *,V 2 *) and E[h(Y)\V*,V£] = E[h(Y)\V£]. Then 

E{h{Y)e uTy *) = E(E[h(Y)\V*,V 3 *]e u ^ v i- v 3 V^V*^*) 

= E(E[h(Y)\ V 3 *] e**i ^ > e*% V Z e*% v * ) 

= E(e i ^ { y?- v ^e it * v l)E{h{Y)e it ^ v z). 

Follow the steps that lead to (13) in the proof of Lemma 3.1 to obtain 

E(h{Y)e ltTu *) = E{h{Y)e ltT ^)e~ t ^ u i t2 . 
The equation can be rewritten as 

E{E[h{Y)\U*]e itTu *) = J E( J E[/i(y)|C/ 1 *]e^ c/ i*)e"^ Ec7 2 t2 . 
Because Uf li_ the right-hand side is 

E{E[h{Y)\U{)e itT ^ +it * u t) = E{E[h{Y)\UtY tTu *). 
It follows that 

E({E[h(Y)\U*] - E[h{Y)\U{]}e ltTu *) = 

for all t. In other words, the Fourier transform of E[h(Y)\U*] — E[h(Y)\U*] 
is zero. Thus E[h(Y)\U*] - E[h(Y)\U{] = almost surely. 
The proof of the second assertion will be omitted. □ 

Table 2 















Pi 


K 


K 




Fuller [15] 


0.881 


0.365 


0.286 


0.098 


SIR (Carroll and Li) 


0.952 


0.219 


0.187 


0.102 


CR 


0.935 


0.291 


0.126 


0.159 



SURROGATE DIMENSION REDUCTION 29 

p 

Proof of Lemma 6.1. 1. That (22) implies 4> p (t; f3 p ) — ► u(t) is well 

known. Now suppose 4> p (t;{3 p ) ^>u>(t). Then \<fi p (t; (3 P )\ 2 —> \oj[t)\ 2 . Because 
both (p p (t;P p ) and \4> p (t; f3 p )\ 2 are bounded, (22) holds. 
2. Because R p _IL S p \T p , (3 p: we have 

E{e itTR 'e iuTs *\T*) 

= E{e itTR *\T*)E(e iuTs *\T*) 

+ [E(e itTR "\T p ,f3 p )E(e iuT ^\T p ,(3 p ) - E{e itT R * \T*)E{e iuT s * \T*)] 

+ [£?(e itTfl *e i,lTs *|r*) -^(e^^e^^lTp,^)]. 

Because (Rp, S p , T p )\(5 p — > (R*,S*,T*) w.i.p., the last two terms on the right- 
hand side are op(l). Hence the left-hand side equals the first term on the 
right-hand side because the former is a nonrandom quantity independent of 
p. □ 

Proof of Lemma 6.2. It suffices to show that, if A p and B p are reg- 
ular sequences of random vectors in W such that A p _LL B p and E(A P ) = 0, 
E(B p ) = 0, then p~ 1 A p B p = op(l). If this is true then we can take A p = 

X p , Up , or Y,uY> x X p and B p = X p ,U p , or X p to prove the desired 

equality. 

By Chebyshev's inequality, 

(39) P(p- l \A T p B p \ >e)< -^E{A T p B p ) 2 . 

E p 

The expectation on the right-hand side is 

\i=l / i=l j=l 

V V 

i=lj=l 

where the inequality is from the Cauchy-Schwarz inequality. By assumption, 
II^aIIf = o(p) and ||S_b|| = o(p). So the right-hand side of (39) converges to 
as p — > oo. Hence A£B p = o P (l). □ 
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